Evidence of a genetically driven metabolomic signature in actively inflamed Crohn’s disease

Crohn’s disease (CD) is characterised by chronic inflammation. We aimed to identify a relationship between plasma inflammatory metabolomic signature and genomic data in CD using blood plasma metabolic profiles. Proton NMR spectroscopy were achieved for 228 paediatric CD patients. Regression (OPLS) modelling and machine learning (ML) approaches were independently applied to establish the metabolic inflammatory signature, which was correlated against gene-level pathogenicity scores generated for all patients and functional enrichment was analysed. OPLS modelling of metabolomic spectra from unfasted patients revealed distinctive shifts in plasma metabolites corresponding to regions of the spectrum assigned to N-acetyl glycoprotein, glycerol and phenylalanine that were highly correlated (R2 = 0.62) with C-reactive protein levels. The same metabolomic signature was independently identified using ML to predict patient inflammation status. Correlation of the individual peaks comprising this metabolomic signature of inflammation with pathogenic burden across 15,854 unselected genes identified significant enrichment for genes functioning within ‘intrinsic component of membrane’ (p = 0.003) and ‘inflammatory bowel disease (IBD)’ (p = 0.003). The seven genes contributing IBD enrichment are critical regulators of pro-inflammatory signaling. Overall, a metabolomic signature of inflammation can be detected from blood plasma in CD. This signal is correlated with pathogenic mutation in pro-inflammatory immune response genes.

Crohn's disease (CD), one of the major subtypes of inflammatory bowel disease (IBD), is a heterogenous, relapsing, remitting condition characterised by transmural inflammation across the gastrointestinal tract. Disease aetiology centres on complex interaction between genetic predispostion and intestinal microbial exposure. Over 240 genes associated with IBD are enriched for proteins linked with bacterial recognition and response pathways, epithelial barrier integrity and downstream inflammatory signalling 1,2 . Whilst effective therapies exist, there is a clear need to stratify patients into risk groups for disease severity, complications and medication response. Reliable genetic and plasma biomarkers provide an attractive mechanism to stratify patients at diagnosis and during follow-up, whilst promoting novel drug discovery 3 .
Nuclear magnetic resonance (NMR) spectroscopy identifies precise constituents of biological samples, whereby molecules present distinct characteristic spectra. NMR has demonstrated the ability to discriminate IBD patients from controls through identification of dysregulated urine and plasma metabolites 4 ; and distinguished IBD patients with active disease from those in remission 5 . Identification of genomic variation associated with disease severity markers, or biomarker profiles, can lead to targeted therapeutics and repurposing of known medications for new conditions 6 . Combining urine NMR spectra analysis with common variants identified through genome-wide association studies has previously been used to discover genetically determined metabolites in unselected samples 7 . Genomic data processing. Whole exome sequencing data were generated using Agilent SureSelect exon capture kits and Illumina HiSeq sequencing platforms. Processing and targeted analyses of the whole exome sequencing data applied herein have been presented elsewhere 9,10 . Genomic data were transformed into perpatient gene pathogenicity scores using the GenePy algorithm 11 . GenePy integrates the effect of multiple variants in each gene incorporating information on variant zygosity, frequency and deleteriousness (inferred using CADD v1.5 scores 12 ). GenePy scores were initially generated for all patients for all 19,229 RefSeq genes. Genes with a Gene Damage Index (GDI) above the recommended threshold (GDI_Phred > 13.84) were excluded as genes with values above this level are considered highly mutable but unlikely to be disease causing 13 . This resulted in a final matrix of 15,854 GenePy scores for all patients.
Metabolomics analysis of plasma. Plasma samples (200 µL) were mixed with deuterium water (D 2 O) (400 µL). The homogenized samples were centrifuged (10 min; 4 °C; 12,000 × g) and transferred to 5 mm NMR tubes for analysis by NMR spectroscopy. Plasma samples were processed into a single batch of 228 CD samples. NMR experiments applied a Bruker AV700 NMR instrument equipped with a 5 mm inverse CryoProbe ™ . A standard 1-dimensional NOESY-PR-1D experiment was performed on each sample, using a standard preset pulse sequence (noesypr1d90°). A Carr-Purcell-Meiboom-Gill (CPMG) experiment was applied (preset pulse sequence cpmgpr1d90°), where simple presaturation of the water signal was used. This experiment reduces the signal contribution from albumin and lipoproteins present in plasma and highlight signals from otherwise overshadowed smaller molecules. All samples were analysed at 297° K, 65 k data point spectrum (spectral width 19,607 Hz) was obtained by recording 256 scans (8 dummy scans). Phase and baseline of the spectra were corrected using MestreNova software v10.0m. NMR spectra were referenced to the glucose peak at δ 5.223 ppm.

Metabolomics statistical analysis.
Full resolution spectra were processed using Matlab vR2017a. The residual water signal was removed. Relative spectra were mean-centred and scaled to unit variance. Principal component analysis (PCA) was used to compare samples and identify outliers. Orthogonal Projection to Latent Structure (OPLS) analysis was performed for the supervised stage of the analysis, where NMR spectra were used as a matrix of variables. Regression of continuous patient CRP measurements against their metabolome data matrix, assessed plasma metabolic profile alteration with active inflammation. Model prediction was evaluated using goodness-of-fit correlation coefficient R 2 , showing what percentage of variation is explained by the model, and goodness-of-prediction (Q 2 ), constituting the percentage of that variance which can be predicted by sevenfold cross-validation (hence splitting the input data in 7 subsets and recursively fit the model on 6 subsets and test its performance on 1 the left-out subset until all subset are used as test-set). Loadings were presented as a pseudo-NMR spectrum, plotting the model back-scaled coefficients and the weight of the variables. Metabolites with an R 2 weight > 0.4 were considered highly discriminatory 14 . Machine learning classification. A random forest classifier (RF) of metabolic profiles was employed to predict patients with active inflammation as measured by CRP levels. While the metabolomic analysis utilised www.nature.com/scientificreports/ continuous CRP values to identify highly correlating peaks, the objective of the RF was to discriminate patients with negligible active inflammation from those with moderate/severe inflammation. Therefore, continuous CRP levels were binarised following the current WHO and FDA guidelines to classify patient bloods as either inflamed (CRP ≥ 5 mg/L) or non-inflamed (CRP < 5 mg/L) 15,16 . The machine learning (ML) approach consisted of three phases (Fig. 1). The first phase involved the use of an RF classifier and a fivefold cross-validated recursive feature elimination approach (RFE-CV) to identify the regions of the NMR spectrum contributing to the non/inflamed patient classification (feature selection). This step recursively excludes 1% of the 38,470 datapoints comprising the metabolic profiles, until all the features are removed, and identifies datapoints consistently important for classification. The resultant selected regions were then employed to generate the final fivefold cross-validated RF model. Averaged metrics collected to assess performance include the F-1 statistic, precision, recall and balanced accuracy. From this final cross-validated model, features ranked within the top 5th percentile of importance were retained for further analysis.
The selected points of the spectra were subsequently binned by their location on the NMR spectrum. Groups of ≥ 10 points observed in close proximity were defined as 'peaks' and their constituent variance summarised using PCA. The discriminatory power of each component in separating inflamed/non-inflamed patients was assessed using Wilcoxon rank sum test and p values adjusted using false discovery rate (FDR). Components with a corrected p value < 0.05 were combined by their sum, generating a single eigenvector for each peak. This process transformed the multiple points within each of the n peaks into a matrix of eigenvector scores for each patient (Fig. 1, phase II). Summing those components significantly discriminating the inflamed and non-inflamed classes after FDR correction allowed for integration into downstream analyses.
In phase III, the resulting eigenvector matrix was integrated with the GenePy-transformed genomic data. These steps resulted in two matrices for all patients summarising: (1) eigenvector scores representing the metabolomic data most discriminatory of inflammation status and; (2) genetic data summarising the pathogenic burden of mutation for each gene. Spearman's rank was used to correlate each of the metabolomic eigenvectors against GenePy scores for 15,854 genes. Genes with a nominally significant correlation were tested for enrichment in human databases (Gene Ontology, KEGG pathways, REACTOME, Complexes (CORUM), Human Phenotypes (HPA), WikiPathway (WP) using gProfiler2 17 . Enrichment scores (p values) were corrected using the SCS method embedded in the gProfiler2 model.

Results
Patients were recruited during routine clinics and untargeted with respect to diseases state, duration or treatment. As expected for paediatric Crohn's disease, our cohort is characterised by an excess of male patients. The cohort reflected heterogeneity expected within clinical service with respect to time since diagnosis, disease state and treatment. Retrospective interrogation of clinical records identified 30 children who had undergone (24-h liquid diet and 4-h) fasting in preparation for colonoscopy and 154 patients for whom same-day blood tests had been clinically requested (Table 1). Medication data were available for all patients, supplementary data 1. Twenty-seven of the 154 patients were on no medications, or only nutritional therapy, at the time of plasma sampling, Table 2.
Metabonomics. NMR-based blood-plasma metabolomic profiles were acquired for all 228 CD patients.
Multivariate analysis of these samples identified a subset of patients whose metabolome was markedly characterised by elevated concentrations of ketone bodies (3-hydroxybutyrate, acetone and acetoacetate, Supplementary  Fig. 1A). Cross referencing with clinical records revealed these patients had undergone bowel preparation for endoscopy procedure prior to the blood sampling used for NMR analyses. This metabolic perturbation was reflected in the OPLS model, which highlighted ketone bodies as strong discriminants, and overshadowed the importance of other discriminative peaks of the spectrum ( Supplementary Fig. 1A, B and C). All 30 patients with documented evidence of bowel preparation for endoscopy prior to plasma collection were therefore excluded from subsequent analyses. Metabolome was regressed against blood CRP readings for the 154 CD patients (R 2 Y = 0.63, Q 2 Y = 0.41; Fig. 2A). The corresponding loadings plot (Fig. 2B) highlights the peaks contributing most to that classification (weight > 0.4). Distinct signals (Table 3) associated with N-acetyl glycoprotein (δ 2.01-2.04 ppm), glycerol (δ 3.56, 3.64 ppm), phenylalanine (δ 7.33, 7.38, 7.43 ppm), and an unidentified lipid signal (δ 2.66 ppm) were identified at significantly higher concentration in plasma samples obtained from patients with higher systemic inflammation.
Machine learning classification. An RF model was employed to discriminate patient classes of actively inflamed (n = 58 patients with CRP ≥ 5 mg/L) versus uninflamed (n = 96 patients CRP < 5 mg/L) cases.
The first phase of modelling identified 23.1% percent (8934 datapoints) of the NMR spectrum as informative ( Supplementary Fig. 2). On average, the model trained and tested on this fraction of the spectrum was effective in distinguishing the non/inflamed classes (mean F-1 statistic = 0.78 ± 0.05; balanced accuracy = 0.82 ± 0.04; precision = 0.84 ± 0.08 and; recall = 0.74 ± 0.05). Figure 3A shows the regions of the spectrum identified by the RF model as most informative in discriminating patients with and without active inflammation.
PCA modelling of the subset of 258 points evaluated as having a relative importance measure within the top 5th percentile ( Fig. 3A; Supplementary Fig. 3) shows reasonable separation between patients according to their inflammation status (Fig. 3B). The distribution of 199 of these points was concentrated within 6 discrete peaks containing ≥ 10 supporting datapoints (Fig. 3C). These six NMR spectrum peaks identified as highly informative in the classification of CD patients with/out active inflammation ( www.nature.com/scientificreports/ defined by RF modelling recapitulate the OPLS findings of those peaks labelled as GlycA (δ 2.03-2.07 ppm); peak 4 defined by ML corresponds to one (δ 3.56-3.57 ppm) of the two spectral signatures that are noted in the OPLS modelling to depict glycerol; and peaks 5 and 6 in the RF model correspond to the two phenylalanine peaks (δ 7.20-7.26 ppm) as seen in Fig. 2B. Furthermore, the individual data points underlying ML derived peaks 3 and 4 are recognised as having the highest average discriminatory value for classification of inflamed status (average importance of 0.29 and 0.35 respectively) with peak 4 exhibiting the highest mean importance and also containing the single data point with the highest discriminative importance (Fig. 3C, Table 4).

Metabolomics-genomics integration.
Single eigenvectors summarising the six RF peaks significantly discriminating the inflamed and non-inflamed classes after FDR correction (Table 4, Supplementary Fig. 4, Supplementary Fig. 5).    www.nature.com/scientificreports/  www.nature.com/scientificreports/ Correlation of each patient's eigenvector scores for each of the six ML-defined metabolomic peaks against their GenePy gene scores was used to determine any relationship between metabolomic signatures of active inflammation and gene pathogenicity scores. This resulted in sets of nominally significant genes that were either positively or negatively correlated. In order to retain potentially informative biological insight, these genes were then grouped and assessed by direction of correlation (positive, negative, all). These gene sets were then interrogated for enrichment of specific functional pathways that might be useful in interpreting NMR peak signatures (Table 5).
Peak 1 is most significantly enriched for 'g-protein coupled receptors (GPCRs)' (WP:WP24; p = 0.004) when considering all correlated genes, although the signal remains significant when considering only genes that are negatively correlated with inflammation status. Specifically, patients designated as having non-inflamed status exhibit a higher burden of pathogenic variation in genes involved in GPCR signalling. Peak 2 is positively correlated with genes enriched to function within receptor complexes (p = 0.02) and regulate actin cytoskeleton (p = 0.03).
Metabolomic peak 4, identified by both OPLS and ML modelling to be most strongly associated with inflammation, contains the most significantly enriched functional gene-sets. One hundred and ten genes whose pathogenicity scores are significantly correlated with peak 4 are enriched to function within the 'intrinsic component of membrane' (p < 0.003; GO:0031224) and its subset-term 'integral component of membrane' (105 genes, p < 0.003; GO:0016021) (Supplementary Table 1). Interestingly, this is the same functional group identified as correlated with peak 3 suggesting a common biological mechanism might drive both metabolic signatures.
Of particular interest given our clinical cohort, is the set of seven genes correlated with peak 4 identified as enriching for molecular function in 'inflammatory bowel disease' (p < 0.003; KEGG:05321). This enrichment is specific to negatively correlating genes, indicating that CD patients with active inflammation are more likely to have a low burden of pathogenic variation within these genes. The seven nominally correlated genes that combine to define this IBD enrichment term are GATA3, IL12B, IL12RB2, IL6, MAF, NFKB1, RORC.
Peak 5 shows a distinct enrichment for the ESR (estrogen signalling receptor) signalling pathway (p < 0.005), an important molecular cascade involved in acute and chronic inflammation. Finally, peak 6 shows weaker enrichment for various enrichment terms many of which reflect of plasma membrane function echoed in peaks 3 and 4.

Discussion
This study combined untargeted metabolomics with whole gene pathogenicity burden scores derived from whole-exome sequencing data from paediatric patients diagnosed with Crohn's disease. Assimilation of clinical and omic data for patient samples modelled NMR spectra into discrete peaks strongly associated with active inflammation detected in plasma. Individual patient differences in these metabolomic signatures of inflammation appear non-random with respect to functional capacity of genes that elicit the pro-inflammatory immune response for which targeted therapies exist 18,19 .
We used two approaches to determine the regions of patient metabolic spectra most associated with inflammation. Results of both OPLS modelling of continuous CRP and RF modelling of binarised CRP levels, culminated to identify the same regions of the spectrum typically assigned to N-acetyl glycoproteins (GlycA), glycerol and phenylalanine. The metabololomic signature of inflammation identified in this study of paediatric CD patients is consistent with that identified in other studies of adult inflammation 20 . GlycA is a composite signal reflecting glycoprotein acetylation of heterogeneous origin 21,22 . Our data independently corroborate this signal as an NMR-derived spectrometric biomarker of systemic inflammation. The same signal was recently highlighted in the context of acute febrile illnesses, chronic inflammatory and autoimmune diseases and found to strongly correlate with CRP, interleukin-6, fibrinogen, serum amyloid A, lipoprotein-associated phospholipase A 2 and tumour necrosis factor 23,24 . While the data here presented reflects a single snapshot of patient's inflammation www.nature.com/scientificreports/ course, previous studies indicated how the GlycA signature might evolve over time 25 -yet with unknown dynamic with respect to CRP-but confirming its role in systemic inflammation 20 . CRP, produced by hepatocytes in response to IL-6, is a non-specific clinical marker of acute and chronic systemic inflammation. However, its efficacy as a single marker is limited by high inter-and intra-individual variability 26 . Although correlated, it has been suggested that the protein glycan biomarker GlycA and CRP may play distinct roles 27 . CRP levels increase in response to bacteria and intracellular antigens of damaged cells, as Table 5. Enrichment results of gene-peak correlations. Significant values are in [bold]. Enriched terms for genes that positively or negatively correlate with the identified peaks. a The term size indicates the number of genes belonging to a specific term in the relative dataset. b The intersection refers to the number of genes from the correlation analysis that overlaps with a specific term. c SCS correction method embedded in gProfiler2. d The complete list of genes enriching for the named term is reported in the Supplementary www.nature.com/scientificreports/ an early acute phase response, whereas haptoglobin, α 1 -acid glycoprotein, α 1 -antitrypsin and transferrin, that contribute the most to the GlycA signal, rise later stage of the inflammatory response 28 . GlycA measurement may represent an independent, more stable biomarker of acute response and systemic inflammation.
Regions of the metabolomic spectrum attributed to glycerol and phenylalanine were consistently associated with the inflammation in our CD patients. Phenylalanine is an aromatic amino acid previously linked to metabolic disturbance 29 and a marker of systemic low grade inflammation possibly arising from liver disfunction, compromised uptake at the blood brain barrier or altered microbiota composition 30 . Glycerol has recently been described as a single molecule systemic biomarker of infection whereby increased glycerol in plasma reflects a metabolic adaptation to intestinal infection, as a provision of sufficient energy for survival 31 . This study provides evidence for a correlation between the NMR glycerol signal and genes known to be involved in the pathogenesis of inflammatory bowel disease. Our data demonstrate increased levels of glycerol in patient's plasma negatively correlating with individual burden of pathogenic mutations in genes driving pro-inflammatory signalling i.e. patients with wild-type sequence across these genes exhibited a higher metabolic signature of inflammation suggesting a more intact and effective pro-inflammatory response. Despite our data modelling being blind to patient diagnosis, objective assessment of over fifteen thousand genes against the metabolomic signature of inflammation, 'inflammatory bowel disease' was amongst the most significantly enriched terms for correlated genes. The seven genes driving this result converge upon pro-inflammatory pathways and extensive data already support their role in IBD. The pathogenesis of Crohn's disease is multi-factorial, but there appears to be a significant proportion of patients where the underlying genetic risk is related to a hypo-immune response (such as loss-of function variants in NOD2) 30 . This concept provides a framework for understanding why low burden of variation in pro-inflammatory 'IBD' genes correspond to high glycerol levels. We hypothesise that in these maintained pro-inflammatory pathways, chronic activation occurs due to alternative hypo-immune response to intestinal bacteria, resulting in chronic inflammation and the observed hyper-inflammatory response 32,33 .
Expression of GATA3, a mediator of Th2 cytokine response to inflammation is dependent on the p50 subunit of NFKB encoded by NFKB1 34 . NFKB is activated by pattern-recognition receptors (PRRs) including the NOD-receptors and a master regulator of immune inflammation with an established role in perturbed mucosal inflammation in CD 35,36 . NFKB recruits several pro-inflammatory cytokines in response to microbial stimulation, including IL-12 and IL-23. IL6, in addition to promoting CRP, drives Th17 lineage development-plasticity of which is also influenced by RORC 37,38 . IL12B is an IBD-associated gene encoding the p40 subunit that is targeted by ustekinumab monoclonal antibody and common to both IL12 and IL23 39,40 . Functional studies in both mice 39 and human patients 41 proved how mutations in its coding sequence can alter the inflammatory response through the formation of the IL-12/IL-23 heterodimer. Although, IL12 and IL23 are both implicated in temporally distinct inflammatory responses to intestinal barrier impairment 42 , concurrent implication in our analysis of IL12RB that encodes the membrane receptor for the IL12 cytokine might suggest IL12 signalling is driving the inflammatory response in our paediatric cohort. Interestingly GlycA has been previously implicated as a tool for measuring inflammation, and specifically within IBD 43,44 . However this is a non-specific marker and the link to underlying genomic variation requires further investigation.
Future optimisation of the approach applied herein is possible. CRP levels are transient and fluctuate with disease state and treatment. Other than their attendance at routine tertiary clinics, our patients were unselected with respect to their disease state or clinical intervention. It is likely that standardising the patient cohort would further improve power to detect genetic signals. Superior power may be gained by focussing on treatment naïve individuals at point of diagnosis, although such samples can be difficult to attain and remain non-uniform with respect to underlying genetics, steroid and antibiotic use and duration of disease prior to first attendance.
In preparation for endoscopy, patients are restricted to a glucose-containing fluid-only diet from 24-h prior to the procedure and nil-by-mouth for the four hours immediately preceding endoscopy. Our data identified a metabolic signature highly inflated for ketone bodies in these patients that may warrant further clinical consideration.
Our data indicate patients with an altered burden of pathogenic mutation within genes critical to mounting the pro-inflammatory immune response following bacterial exposure, harbour a distinctive metabolomic signature (δ 3.56-3.57 ppm) reflecting inflammatorily active disease. This metabolomic signature and its correlated genes warrant further investigation as biomarkers to stratify CD patients into groups that may respond differently to targeted monoclonal antibodies. While our study focussed on children with a diagnosis of Crohn's disease, we suggest mutations in these genes are unlikely to represent the primary CD disease trigger in many of these patients, but instead contribute to an individual genomic profile that substantially modulates the inflammatory response and disease progression.

Data availability
The datasets generated and/or analysed during the current study are available through direct collaborative agreements, in line with the informed consent gained from all participants. www.nature.com/scientificreports/